Page 1
Table of Contents
Cloud With Raj
www.cloudwithraj.com
- AWS Services for Interviews
- Microservice with ALB
- Event Driven Architecture (Basic)
- Microservice Vs Event Driven
- Event Driven Architectures (Advanced)
- Three-Tier Architecture
- Availability Zone & Data Center
- Lambda Cheaper Than EC2?
- RPO Vs RTO
- IP Address Vs URL
- DevOps Phases
- CI Vs CD Vs CD
- Kubernetes Tools Landscape
- Traditional CICD Vs GitOps
- Platform Vs Developer Team
- Scaling EC2 Vs Lambda Vs EKS
- Kubernetes (EKS) Scaling
- Gen AI Layers
- Prompt Engineering Vs RAG Vs Fine Tuning
- RAG with Bedrock
- EKS Upgrade With Karpenter
- EKS Upgrade With Karpenter - Advanced
- Container Lifecyle - Local to Cloud
- Gen AI Multi Model Invocation
- Git Workflow
- DevSecOps Workflow
- What Happens When You Type an URL
- AWS Well Architected Framework
- AWS Migration Tools
- Multi-site Active Active DR
- Kubernetes Node Pod Container Relationship
- Running Batch Workloads on AWS
- STAR Interview Format
- Hybrid Cloud Architecture
- How Karpenter Saves Money
- API Gateway Auth
- Serverless Web Application
- EDA with Kubernetes
- EDA with SNS, SQS, Kubernetes
- EKS Auto (Re:Invent 2024)
- Aurora DSQL (Re:Invent 2024)
- S3 Tables (Re:Invent 2024)
- Docker Vs Kubernetes
- System Design Trade-Offs
- Top 3 Popular Design
- Three Tier with Microservice
- Pod Container Sidecar
- Karpenter Bin Pack Granular Control
- Monolith Vs Microservice
- EventBridge Cross Account
- Kubernetes Tech Stack
- Lambda Transform
- Microservice Tech Stack
- Live Streaming
- Live Streaming with Ads
More To Be Added...
AWS Cloud Interview Guide
Raj's Bio
- Principal SA at (6+ Years)
- 20+ Years of IT experience
- Designed and implemented multiple world-scale
projects with official AWS blogs on them
- Trained students get SA jobs via SA Bootcamp
- Bestselling author (60,000+ paid students)
- Presented highly-rated talks at major events
- LinkedIn Top Systems Design Voice
Lighthouse projects designed : Dr. B. Covid Vaccine
Registration, Collins Flight Systems used by 100+
airlines, Freddie Mac Datacenter to AWS Migration, and
more
Please follow me on socials:
Page 2
Important AWS Services for Interviews
Cloud With Raj
Compute
Storage
Network
AWS Global Infrastructure
(Region, AZ)
Security
Migration
Gen AI
Partyrock
Event Driven
Observability
Cost Optimization
Analytics
DevOps
EC2
Auto Scaling
Lambda
Elastic Kubernetes Service
S3
EBS
RDS
DynamoDB
ElastiCache
VPC
Load Balancer
API Gateway
KMS
IAM
WAF
Shield
GuardDuty
Secrets
Manager
Config
DMS
Migration Hub
Application Migration
Service
Application Discovery
Service
BedRock
Q
SNS
SQS
EventBridge
Step Functions
CloudWatch
CloudTrail
X-Ray
Compute
Optimizer
CloudWatch
Insights
Cost Explorer
Budget
Spot Instance
Reserve
Instance
Reporting
Savings Plan
Glue
EMR
Athena
QuickSight
Kinesis
CloudFormation
SageMaker
www.cloudwithraj.com
ECS
Page 3
Microservices with ALB
EC2
(Running Code)
ALB
(url: cloudwithraj.com)
Database
Amazon Aurora
EC2
(Scaled Up)
Auto Scaling Group
Cloud With Raj
www.cloudwithraj.com
Domain: cloudwithraj.com
/browse
Target Group 1
Lambda
/buy
Target Group 2
/* (Catch all)
Target Group 3
(Handles traffic of
cloudwithraj.com/browse)
(Handles traffic of
cloudwithraj.com/buy)
(Handles traffic of anything
else)
Database
Amazon DynamoDB
Database
Amazon Aurora
Microservice 1
Microservice 2
Microservice 3
EKS
Path Based Routing
Page 4
Event Driven Architecture (Basic)
Cloud With Raj
www.cloudwithraj.com
API Gateway
SQS
Lambda
An event-driven architecture decouples the producer and processor. In this
example producer (human) invokes an API, and send information in JSON
payload. API Gateway puts it into an event store (SQS), and the processor
(Lambda) picks it up and processes it. Note that, the API gateway and Lambda
can scale (and managed/deployed) independently
Benefits of an event-driven architecture
1. Scale and fail independently - By decoupling your services, they are only
aware of the event router, not each other. This means that your services are
interoperable, but if one service has a failure, the rest will keep running. The
event router acts as an elastic buffer that will accommodate surges in
workloads.
2. Develop with agility - You no longer need to write custom code to poll, filter,
and route events; the event router will automatically filter and push events to
consumers. The router also removes the need for heavy coordination between
producer and consumer services, speeding up your development process.
3. Audit with ease - An event router acts as a centralized location to audit your
application and define policies. These policies can restrict who can publish and
subscribe to a router and control which users and resources have permission to
access your data. You can also encrypt your events both in transit and at rest.
4. Cut costs - Event-driven architectures are push-based, so everything
happens on-demand as the event presents itself in the router. This way, you’re
not paying for continuous polling to check for an event. This means less network
bandwidth consumption, less CPU utilization, less idle fleet capacity, and less
SSL/TLS handshakes.
Event Store
Page 5
Lambda
Database
Amazon DynamoDB
Microservice
Microservice Vs Event Driven
Cloud With Raj
www.cloudwithraj.com
API Gateway
SQS
Lambda
Event Store
/buy (POST)
Domain: cloudwithraj.com
(processes messages from SQS for
cloudwithraj.com/buy (POST)) )
Database
Amazon DynamoDB
Websocket
API
Microservice with Event Driven
API Gateway
/buy (POST)
(Handles traffic of
cloudwithraj.com/buy (POST))
The main differences are:
1. Traditional microservice is synchronous i.e. the request and response happens
with the same invocation. Whereas with Event Driven, User gets a confirmation
that message is inserted into SQS. But he doesn't get the response from the
actual message processing by the Lambda in the same invocation. Instead the
backend Lambda needs to send response out, in this case, using websocket APIs,
to the user. Or the user can query the status afterwards
2. With EDA, API Gateway, and Lambda/Database can scale independently.
Lambda can consume messages at a rate not to overwhelm the database
3. With EDA retries are built in. With microservices, if Lambda fails, user need
to send the request again. With EDA, once the message is in SQS, even if
Lambda fails, SQS will automatically retry
Page 6
Event Driven Architecture (Advanced)
Cloud With Raj
www.cloudwithraj.com
API Gateway
SQS 1
Lambda 1
Based on values in the message, EventBridge
can fire different targets
Event Store + Router
EventBridge
Step Function
Lambda 2
Rule 1
Rule 2
Rule 3
SQS 2
SQS 3
SNS
Lambda 1
Lambda 2
EKS
Application
Destination
Filter 1
Destination
Filter 2
Destination
Filter 3
Based on values in the message, SNS can fire
different targets
SNS Vs SQS Vs EventBridge Detailed Video:
Page 7
3 Tier Architecture
EC2
Webserver
External Facing ALB
Internal ALB
Database
Amazon Aurora
EC2
Webserver
EC2
Appserver
EC2
Appserver
Auto Scaling Group
Auto Scaling Group
Availability Zone 1
Availability Zone 1
Availability Zone 2
Availability Zone 2
PRESENTATION LAYER
APPLICATION LAYER
DATABASE
Cloud With Raj
www.cloudwithraj.com
1. First layer is presentation layer. Customers consume the application using this layer.
Generally, this is where the front end runs. For example - amazon.com website. This is
implemented using an external facing load balancer distributing traffic to VMs (EC2s) running
webserver.
2. Second layer is application layer. This is where the business logic resides. Going with the
previous example - you browsed your products on amazon.com, and now you found the product
you like and then click "add to cart". The flow comes to the application layer, validates the
availability, and then creates a cart. This layer is implemented with internal facing load
balancer and VMs running applications.
3. The last layer is the database layer. This is where information is stored. All the product
information, your shopping cart, order history etc. Application layer interacts with this layer
for CRUD (Create, Read, Update, Delete) operations. This could be implemented using one or
mix of databases - SQL (e.g. Amazon Aurora), and/or NoSQL (DynamoDB)
Lastly - why is this so popular in interviews? This architecture comprised of many critical
patterns - microservices, load balancing, scaling, performance optimization, high availability,
and more. Based on your answers, interviewer can dig deep and check your understanding of
the core concepts.
Page 8
How Many Data Centers in One Availability
Zone?
Correct Answer:
An AWS availability zone (AZ) can contain multiple data centers. Each zone is usually backed
by one or more physical data centers, with the largest backed by as many as five.
Incorrect Answer:
One Availability Zone means one data center
1 Availability Zone
Cloud With Raj
www.cloudwithraj.com
Page 9
Is AWS Lambda Cheaper than Amazon EC2?
Incorrect Answer:
Yes, AWS Lambda is cheaper than Amazon EC2
Correct Answer:
It depends on the application. Both Lambda and EC2 have different cost factors (see above).
It is possible that, depending on the application, AWS Lambda can have a higher charge than
EC2, and vice versa. it is important to consider not just the compute cost but the TCO (Total
Cost of Ownership). With AWS Lambda there is no AMI to maintain, patch, and rehydrate,
reducing management overhead and hence overall TCO.
Lambda Cost Factors:
Architecture (x86 vs. Graviton)
Number of requests
Duration of each request
Amount of memory allocated (NOT used)
Amount of ephemeral storage allocated
EC2 Main Cost Factors:
Instance family
Attached EBS
Duration of EC2 runtime
s
Cloud With Raj
www.cloudwithraj.com
Page 10
RPO Vs RTO
Cloud With Raj
www.cloudwithraj.com
Disaster
Recovery Time (RTO)
< DOWNTIME >
< DATA LOSS >
Recovery Point (RPO)
Time
Candidates are sometimes confused by RPO thinking it's measured in unit of data, e.g.
gigabyte, petabyte etc.
Correct Answer:
Both RPO and RTO are measured in time. RTO stands for Recovery Time Objective and is
a measure of how quickly after an outage an application must be available again. RPO, or
Recovery Point Objective, refers to how much data loss your application can tolerate.
Another way to think about RPO is how old can the data be when this application is
recovered i.e. the time between the last backup and the disaster. With both RTO and
RPO, the targets are measured in hours, minutes, or seconds, with lower numbers
representing less downtime or less data loss. RPO and RTO can be and often have
different values for an application.
Most Recent Backup
Page 11
IP Address Vs. URL
Bad Answer:
URL is a link assigned to an IP address
Virtual
Machine
(E.g. EC2)
IPAddress1
192.50.20.12
Virtual
Machine
(E.g. EC2)
IPAddress2
212.60.20.12
DNS
(Domain Name
System)
Load Balancer
Assigns URL to Load Balancer
(Uniform Resource Locator)
Access URL
Virtual
Machine
(E.g. EC2)
IPAddress1
250.80.10.12
(Went Down!!)
Correct Answer:
IP Address is a unique number that identifies a device connected to the internet, such as a
Virtual Machine running your application. However, accessing a resource using this unique
number is cumbersome; moreover, let's say when a VM comes down (the bottom one in the
diagram), a new VM comes up to replace it with a different IP address. Hence, in reality,
application running inside the VM is accessed using URL or Uniform Resource Locator.
One URL does generally NOT map to one IP address; rather, the URL (e.g., www.amazon.com) is
mapped to a Load Balancer, and that Load Balancer distributes traffic to multiple VMs with
different IP addresses. Even if one VM goes down and another comes up, this Load Balancer
using a URL always works because the Load Balancer appropriately distributes traffic across
healthy instances. This way, you (the user) do not need to worry about the underlying IP
addresses.
Cloud With Raj
www.cloudwithraj.com
Page 12
DevOps CICD Phases with Tools
Author
Source
Build
Test
Deploy
Monitor
Write code
Check-in
source code
Compile code
Create artifacts
Unit testing
Integration testing
Load testing
UI testing
Penetration testing
Deploy artifacts
Logs, metrics, and
traces
VS Code
AWS CodeCommit
GitHub
AWS CodeBuild
Jenkins
AWS CodeBuild
Jenkins
Jenkins AWS CodeDeploy
Amazon
CloudWatch
AWS X-Ray
Continuous Integration (CI)
Continuous Deployment (CD)
Cloud With Raj
www.cloudwithraj.com
Page 13
DevOps CICD Phases
Author
Source
Build
Test
Deploy
Monitor
Write code
Check-in
source code
Compile code
Create artifacts
Unit testing
Integration testing
Load testing
UI testing
Penetration testing
Deploy artifacts
Logs, metrics, and
traces
VS Code
AWS CodeCommit
GitHub
AWS CodeBuild
Jenkins
AWS CodeBuild
Jenkins
Jenkins AWS CodeDeploy
Amazon
CloudWatch
AWS X-Ray
Continuous Integration (CI)
Continuous Deployment (CD)
Continuous Delivery (CD)
Manual Approval
Cloud With Raj
www.cloudwithraj.com
Page 14
Kubernetes Tools Ecosystem with AWS
Cloud Implementation
Observability
Scaling
Delivery/Automation
Security
Cost Optimization
Amazon EKS
Cloud With Raj
www.cloudwithraj.com
Prometheus
Grafana
Fluentbit
Jaeger
ADOT
CloudWatch
Karpenter
AutoScaling
Argo
Terraform
Jenkins
Github Actions
Gitlab CICD
Gatekeeper
Trivvy
ECR Scan
GuardDuty
Kube Bench
Secrets
Manager
Istio
CloudWatch
Container
Insights
Cost and Usage
Report (New
feature - Split
Cost Allocation)
Kubecost
X-Ray
Page 15
Code &
Dockerfile
Manifests
Git Repo
Amazon ECR
Container Image
CI Tool
CD Tool
CD Tool
Manifests
updated with
container image tag
1
2
Amazon EKS
Traditional CICD Vs. GitOps
Code &
Dockerfile
Manifests
Git Repo
Amazon ECR
Container Image
CI Tool
CD Tool
GitOps Tool Installed
in Cluster
Manifests
updated with
container image tag
A
B
Amazon EKS
Checks for
difference
between cluster
and Git
Pulls in
changed files
Traditional CICD
GitOps
3
C
Pushes files
Traditional DevOps
Step 1: Developers check in Code, Dockerfile, and manifest YAMLs to an application repository. CI tools (e.g.,
Jenkins)
kick off, build the container image and save the image in a container registry such as Amazon ECR.
Step 2: CD tools (e.g. Jenkins) update the deployment manifest files with the tag of the container image.
Step 3: CD tools (e.g. Jenkins) execute the command to deploy the manifest files into the cluster, which, in
terms,
deploys the newly built container in the Amazon EKS cluster.
Conclusion - Traditional CICD is a push based model. If a sneaky SRE changes the YAML file directly in the
cluster (e.g.
changes number of replica, or even the container image itself!), the resources running in the cluster will
deviate from
what's defined in the YAML in the Git. Worse case, this change can break something, and DevOps team need to
rerun
part of the CICD process to push the intended YAMLs to the cluster
GitOps
Step A: Developers check in Code, Dockerfile, and manifest YAMLs to an application repository. CI tools (e.g.,
Jenkins)
kick off, build the container image and save the image in a container registry such as Amazon ECR.
Step B: CD tools (e.g. Jenkins) update the deployment manifest files with the tag of the container image.
Step C: With GitOps, Git becomes the single source of truth. You need to install a GitOps tool like Argo inside
the cluster
and point to a Git repo. Git keeps checking if there is a new file, or if the files in the cluster drifts from
the ones in Git.
As soon as YAML is updated with new container image, there is a drift between what's running in the cluster
vs what's in
Git. ArgoCD pulls in this updated YAML file and deploys new container.
Conclusion - GitOps does NOT replace DevOps. As you can see GitOps only replaces part of the CD process. If we
think
about the previous scenario where the sneaky SRE directly changes the YAML in cluster, ArgoCD will detect the
mismatch
between the changed file vs the one in Git. Since there is a difference, it will pull in the file from Git and
bring
Kubernetes resources to it's intended state. And don't worry, Argo can also send a message to the
sneaky SRE's manager
;).
Cloud With Raj
www.cloudwithraj.com
Page 16
Developer
Requests
Infrastructure
Ticketing
System
Platform Team
Infra as Code (IaC)
(Terraform, CDK etc.)
Code &
Dockerfile
Manifests
Git Repo
Amazon ECR
Container Image
CI Tool
CD Tool
CD Tool
Manifests
updated with
container image tag
Container deployed
1
2
4
5
6
3
Amazon EKS
Platform Team and Developer Team
Recently, the term "platform team" has been floating around plenty. But what do platform team do? How
are they
different from the developer team? Let's understand with the diagram below:
Step 1: The developer team requests the Platform team to provision appropriate AWS resources. In this example,
we are
using Amazon EKS for the application, but this concept can be extended to any other AWS service. This request
for AWS
resources is typically done via the ticketing system.
Step 2: The platform team receives the request.
Step 3: The platform team uses Infrastructure as Code (IaC), such as Terraform, CDK, etc., to provision the
requested
AWS resources, and share the credentials with the Developer team.
Step 4: The developer team kicks off the CICD process. We are using a container process to understand the flow.
Developers check in Code, Dockerfile, and manifest YAMLs to an application repository. CI tools (e.g., Jenkins,
GitHub
actions) kick off, build the container image and save the image in a container registry such as Amazon ECR.
Step 5: CD tools (e.g. Jenkins, Spinnaker) update the deployment manifest files with the tag of the container
image.
Step 6: CD tools execute the command to deploy the manifest files into the cluster, which, in terms, deploys the
newly
built container in the Amazon EKS cluster.
Conclusion - The platform team takes care of the infrastructure (often with the guardrails) appropriate for the
organization, and the developer team uses that infrastructure to deploy their application. The platform team
does the
upgrade and maintenance of the infrastructure to reduce the burden on the developer team.
Cloud With Raj
www.cloudwithraj.com
Page 17
Scaling Difference Between Lambda, EC2, EKS
Cloud With Raj
www.cloudwithraj.com
EC2
Auto Scaling Group
Lambda
EKS
EC2
EC2
Lambda
Lambda
EC2
Auto Scaling Group
EC2
EC2
EC2 Scaling: You need to use a Auto Scaling Group (ASG) and define on
what EC2 metric you want it to scale e.g. CPU utilization. You can use
ASG's "minimum number of instances" to run certain number of instances
on all time. Recently ASG also supports scheduled scaling, and warm pool.
Lambda Scaling: No ASG needed. For each incoming connection, Lambda
automatically scales. Consider increasing the concurrency setting for
Lambda as needed. Implement Provisioned Concurrency to keep certain
number of Lambda pre-warmed. This can be done either with schedule or
based on Provisioned Concurrency utilization.
EKS Scaling: EKS scaling is the most complex scaling. You may or may
NOT use a Auto Scaling Group but it does NOT work like regular EC2
scaling. Please refer to the next page to learn about EKS
(Kubernetes/K8s) scaling in detail
Page 18
1
Worker VM
Worker VM
Worker VM
Worker VM
Worker VM
Node Autoscaler
Pending
Unschedulable
2
3
4
How Does Kubernetes Worker Nodes Scale?
Correct Answer:
Step 1: You configure HPA (Horizontal Pod Autoscaler) to increase the replica of your pods at a
certain CPU/Memory/Custom Metrics threshold.
Step 2: As traffic increases and the pod metric utilization crosses the threshold, HPA
increases the number of pods. If there is capacity in the existing worker VMs, then the
Kubernetes kube-scheduler binds that pod into the running VMs.
Step 3: Traffic keeps increasing, and HPA increases the number of replicas of the pod. But
now, there is no capacity left in the running VMs, so the kube-scheduler can't schedule the
pod(yet!). That pod goes into pending, unschedulable state
Step 4: As soon as pod(s) go to pending unschedulable state, Kubernetes node scalers (such as
Cluster Autoscaler, Karpenter etc.) provisions a new node. Cluster Autoscaler requires an Auto
Scaling Group where it increases the desired VM count, whereas Karpenter doesn't require an
Auto Scaling Group or Node Group. Once the new VM comes up, kube-scheduler puts that
pending pod into the new node.
Incorrect Answer:
Set Auto Scaling Groups to scale at a certain VM metric utilization like scaling regular VMs.
Cloud With Raj
www.cloudwithraj.com
Page 19
Gen AI 4 Layers
Silicon Chips
(E.g. AMD, NVIDIA)
LLM Models
(E.g. Open AI, Anthropic,
DeepSeek)
Infrastructure Providers to
Host/Train LLMs
(E.g. AWS, Azure, GCP)
Applications
(E.g. Adobe Firefly, LLM
Chatbots)
HARD
EASY
Learning Curve
Opportunity For New Market Players
Most amount of jobs
(MLOps, LLM with
Kubernetes/Serverless,
Cloud LLM services etc.)
Cloud With Raj
www.cloudwithraj.com
Gen AI hype is at an all-time high, and so is the confusion. What do you study, how do you think about it,
and where are the most jobs? These are the burning questions in our minds. Gen AI can be broken down into
the following four layers.
1. The bottom layer is the hardware layer, i.e., the silicon chips that can train the models. Example - AMD,
NVIDIA
2. Then comes the LLM models that get trained and run on the chips. Examples are Open AI, Anthropic etc.
3. Then comes infrastructure providers who provide an easier way to consume, host, train, inference the
models. Example is AWS. This layer consists of managed services such as Amazon Bedrock, which hosts pre-
trained models, or provision VMs (Amazon EC2) where you can train your own LLM
4. Finally, we have the application layer which uses those LLMs. Some examples are Adobe Firefly, LLM
chatbots, LLM travel agents etc.
Now, the important part - as you go from the bottom to the top, the learning curve gets easier, and so does
the opportunity for new market players to enter. Building new chips requires billions of dollars of investments,
and hence, it's harder for new players to enter the market. The most opportunities are in the top two
layers. If you already know the cloud, then integrating Gen AI with your existing knowledge will increase your
value immensely. If you are working in DevOps, learn MLOps; if you know K8s/Serverless, learn how you can
integrate Gen AI with those; if you work in an application; integrate with managed LLM services to enhance
functionality, you got the idea!
Page 20
Prompt Engineering Vs RAG Vs Fine Tuning
Amazon Bedrock
(Hosts LLM)
Prompt Engineering
Prompt
Subpar Response
Enhanced Prompt
Better Response
1. You send a prompt to the LLM (hosted in Amazon Bedrock in this case), and
get a response which you are not satisfied with
2. You enhance the prompt, and finally come up with a prompt that gives desired
better response
1
2
Prompt that can be
enhanced by company data
1
RAG (Retrieval Augmented Generation)
Code/Jupyter Notebook/App
Company data
Embeddings
Vector Database
Search Vector DB with
the prompt
Retrieve relevant info
related to prompt
Amazon Bedrock
(Hosts LLM)
Augment original prompt
with retrieved info
Generated answer
Generated answer
2
3
4
5
1. RAG (Retrieval Augmented Generation) is used where the response can be
made better by using company specific data that the LLM does NOT have. You
store relevant company data into a vector database. This is done by a process
called embeddings where data is transformed into numeric vectors
2. User gives a prompt which can be made better by adding company specific
info
3. A process (code/jupyter notebook/application) converts the prompt into vector
and then search the vector database. Relevant info from the vector database is
RETRIEVED (First Part of RAG) and returned
4. The original prompt is AUGMENTED (Second part of RAG) with this company
specific info and sent to LLM
5. LLM GENERATES (Last part of RAG) the response and sends back to the
user
Fine Tuning
Base LLM
Fine Tuned LLM
Task-specific
training dataset
Prompt for
organization use case
Response
1. If you need a LLM which is very specific to your company/organization's use
case that RAG can't solve, you train the base LLM with large training dataset
for the tasks. The output is a fine tuned LLM.
2. User asks question related to the use case and gets answer
Cloud With Raj
www.cloudwithraj.com
Page 21
Prompt that can be
enhanced by company data
1
Code/Jupyter Notebook/App
Company data
Embeddings
Bedrock Knowledge Base
Search Vector DB with
the prompt
Retrieve relevant info
related to prompt
Amazon Bedrock LLMs
Augment original prompt
with retrieved info
Generated answer
Generated answer
2
3
4
5
RAG (Retrieval Augmented Generation) is used where the response can be made better by using
company specific data that the LLM does NOT have. Amazon BedRock makes it very easy to do
RAG. Below are the steps:
1. You store relevant company data into a S3 bucket. Then from BedRock Knowledge bases, you
select an emedding LLM (Amazon Titan Embed or Cohere Embed) which converts the S3 data into
embeddings (vector). Knowledge Base can also create a serverless vector store for you to save
those embeddings. Alternatively you can also bring your own Vector Database (OpenSearch,
Aurora, Pinecone, Redis)
2. User gives a prompt which can be made better by adding company specific info
3. A process (code/jupyter notebook/application) converts the prompt into vector and then search
the vector database. Relevant info from the vector database is RETRIEVED (First Part of RAG)
and returned
4. The original prompt is AUGMENTED (Second part of RAG) with this company specific info and
sent to another Bedrock LLM
5. BedRock LLM GENERATES (Last part of RAG) the response and sends back to the user
Cloud With Raj
www.cloudwithraj.com
S3
OpenSearch
Serverless
(Vector Store)
OR
(BYO Vector Store)
Bedrock LLM to
convert data to
embedding
RAG (Retrieval Augmented Generation) with Amazon BedRock
Page 22
EKS Upgrade Simplified Using Karpenter
Cloud With Raj
www.cloudwithraj.com
EKS Control Plane
EC2
(EKS-Optimized AMI
for v1.27)
EC2
(EKS-Optimized AMI
for v1.27)
EKS Data Plane
Systems Manager
Parameter Store
(EKS-Optimized AMI list
for all EKS versions)
Karpenter
(Running in Data Plane)
Gets latest AMI for v1.27
EKS Version 1.27
Provisions EC2s with latest
AMI for v1.27
EKS Control Plane
EC2
(EKS-Optimized AMI
for v1.28)
EC2
(EKS-Optimized AMI
for v1.28)
EKS Data Plane
Systems Manager
Parameter Store
(EKS-Optimized AMI list
for all EKS versions)
Karpenter
(Running in Data Plane)
Gets latest AMI for v1.28
EKS Version 1.28
Recycles Worker Nodes
with v1.28 AMI in Rolling
Deployment Fashion
Updates Control Plane to
v1.28
New EKS Version Released
1
2
3
EKS upgrade can be tedious. But Karpenter can automatically upgrade your Data Plane worker nodes reducing your
burden. Here's
how:
a. EKS Optimized AMI IDs are listed in AWS Systems Manager parameter store. CNCF project Karpenter, the next gen
cluster
autoscaler, periodically checks this, and reconciles with the running worker nodes to see if they are running
with the latest EKS-
Optimized AMI for the particular EKS version. In this case, let's assume EKS is running with v1.27.
b. At a certain point, EKS releases next version 1.28. And the below workflow takes place:
1. Admin upgrades the EKS control plane to v1.28.
2. Following the previous logic, Karpenter retrieves the latest AMI for v1.28 and check if worker nodes are
running with those.
They are NOT, so a Karpenter Drift is triggered.
3. To fix this Drift, Karpenter automatically updates the worker nodes to v1.28 AMIs. And it does it using
rolling deployment
(i.e. a new node comes up, existing node cordoned and drained, then terminated). It also respects the Kubernetes
eviction API
parameters, such as maintaining PDB.
If you want to know the process in detail, including with custom AMIs, check out the Karpenter drift blog -
https://aws.amazon.com/blogs/containers/how-to-upgrade-amazon-eks-worker-nodes-with-karpenter-drift/
Page 23
EKS Upgrade Using Karpenter - Automatic and Manual
Cloud With Raj
www.cloudwithraj.com
EKS Control Plane
EC2
(EKS-Optimized AMI
for v1.28)
EC2
(EKS-Optimized AMI
for v1.28)
EKS Data Plane
Systems Manager
Parameter Store
(EKS-Optimized AMI list
for all EKS versions)
Karpenter
(Running in Data Plane)
Gets latest AMI for v1.28
EKS Version 1.28
Recycles Worker Nodes
with v1.28 AMI in Rolling
Deployment Fashion
Updates Control Plane to
v1.28 from v1.27
1
2
3
Karpenter can automatically upgrade your K8s worker nodes. What if you don't want to - that's possible
too. Here's how:
You can pin your worker nodes to a specific version of the AMI using Karpenter EC2NodeClass. For example, in the
below
diagram, Karpenter will provision the worker nodes with Amazon EKS BR Optimized Linux AMI for version v1.20.5.
Even after AWS releases a new AMI or you upgrade your EKS Control plane, Karpenter will NOT upgrade data plane.
Remember
that - EKS Data Plane can run with AMIs 3 versions behind the Control Plane. Though it is recommended to run
with latest
version due to security patches.
The general practice I see for critical projects with my customers is - that they will let Karpenter
automatically upgrade to the
latest AMI version in dev/test and pin to a specific AMI in prod. Once the newest version is tested in dev/test,
production
EC2NodeClass will be changed to point to this new version. Once EC2NodeClass is updated, Karpenter will upgrade
the worker
node AMIs in a rolling deployment fashion.
Automatic
Manual (More Control)
EKS Control Plane
EC2
(EKS-Optimized
Bottlerocket AMI
v1.20.5)
EKS Data Plane
Karpenter
(Running in Data Plane)
EKS Version 1.28
Worker nodes are pinned
to specific version and
NOT auto-upgraded
1
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: bottlerocket@latest
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: bottlerocket@v1.20.5
The "alias: bottlerocket@latest" in EC2NodeClass ensures Karpenter will always automatically upgrade
your worker nodes if AWS
releases a new AMI OR you upgrade your EKS Control Plane
EC2
(EKS-Optimized
Bottlerocket AMI
v1.20.5)
Page 24
Container Lifecycle - Local to Cloud
Developer
Code &
Dockerfile
Manifest
with image url
(In Local Machine)
Amazon ECR
Container Image
Amazon EKS
Test Container in Local
Machine
docker build
docker run
docker push
kubectl apply
Cloud With Raj
www.cloudwithraj.com
The fundamental container workflow from local machine to cloud is below:
1. Developer writes code, and associated Dockerfile to containerize the code in her local machine
2. She uses "Docker build" command to create the container image, in her local machine. At this point
container image is saved in the local machine
3. Developer uses "Docker run" command to run the container image, and test out the code running from
the
container. Developer can repeat Steps 1-3, till the testing goes as per the requirements
4. Next, developer runs "Docker push" command to push the container image from the local machine to a
container registry. Some examples are DockerHub, or Amazon ECR.
5. Finally, using "Kubectl apply" command, an YAML manifest which has the URL of the container image
from
the Amazon ECR, is deployed into the running Kubernetes cluster.
Note that, this is to understand the lifecycle of the container. In real-world, after testing is done in local
machine, the following steps are automated. Refer to the "Traditional CICD vs GitOps" page for that
workflow
Page 25
API Gateway
Lambda 1
Based on values in
the message,
EventBridge can fire
different targets
Event Store + Router
EventBridge
Step Function
ECS Fargate
Rule 1
Rule 2
Rule 3
Gen AI Multi Model Invocation
Sagemaker
Jumpstart
Bedrock
Bedrock
Invokes
LLM A
Invokes
LLM B
Invokes
LLM C
Cloud With Raj
www.cloudwithraj.com
Page 26
Git Workflow
Cloud With Raj
www.cloudwithraj.com
git add
git commit
git push
Local Machine
Remote Repository
(E.g. GitHub, GitLab)
Local IDE
(E.g. Visual Studio Code)
file1
file1
file1
file1
Index/Staging
Area
Local Repository
(Works like a
database)
Git and GitHub for Beginners Crash Course (Click on the YouTube icon):
Page 27
DevSecOps Workflow
Cloud With Raj
www.cloudwithraj.com
Author
Source
Build
Test
Deploy
Monitor
Use AWS Secrets Manager
How is the app exposed?
Use private subnet, AWS WAF,
AWS Shield, AuthN/Z
Static code analysis
Sample tools -
SonarQube, graudit
Lint Infra as Code
VS Code
AWS CodeCommit
GitHub
AWS CodeBuild
Jenkins
AWS CodeBuild
Jenkins
Jenkins AWS CodeDeploy
Amazon
CloudWatch
AWS X-Ray
Security Embedded Throughout The Pipeline
Penetration testing
DDoS Testing
Fault injection
simulation
Dynamic testing and analysis
Sample tools - Astra, Invicti
Monitor host
Page 28
What Happens When You Type An URL
Cloud With Raj
www.cloudwithraj.com
Types www.amazon.com
Checks local
machine
caches for
IP Address
of the URL
If NOT cached
DNS root name server
Name server for .com TLD
Amazon Route 53
Get IP
address (LB
of URL
frontend)
DNS Resolver
EC2 Hosting Amazon.com
EC2 Hosting Amazon.com
Obtained IP address
Front end pages
Free video explaining steps in detail: Click
1
2
3
4
5
Load Balancer (LB)
6
Page rendered and displayed to
user
Page 29
The operational excellence pillar focuses
on running and monitoring systems, and
continually improving processes and
procedures. Key topics include
automating changes, responding to
events, and defining standards to
manage daily operations.
Cloud With Raj
www.cloudwithraj.com
Well Architected Framework
Operational Excellent Pillar
Security Pillar
Reliability Pillar
Performance Efficiency Pillar
Cost Optimization Pillar
Sustainability Pillar
The security pillar focuses on protecting
information and systems. Key topics
include confidentiality and integrity of
data, managing user permissions, and
establishing controls to detect security
events.
The reliability pillar focuses on
workloads performing their intended
functions and how to recover quickly
from failure to meet demands. Key
topics include distributed system
design, recovery planning, and adapting
to changing requirements.
The performance efficiency pillar
focuses on structured and streamlined
allocation of IT and computing
resources. Key topics include selecting
resource types and sizes optimized for
workload requirements, monitoring
performance, and maintaining efficiency
as business needs evolve.
The cost optimization pillar focuses on
avoiding unnecessary costs. Key topics
include understanding spending over
time and controlling fund allocation,
selecting resources of the right type
and quantity, and scaling to meet
business needs without overspending.
The sustainability pillar focuses on
minimizing the environmental impacts of
running cloud workloads. Key topics
include a shared responsibility model for
sustainability, understanding impact, and
maximizing utilization to minimize
required resources and reduce
downstream impacts.
The AWS Well-Architected Framework describes key concepts, design principles, and architectural best practices
for designing and running workloads in the cloud. By answering a few foundational questions, learn how well your
architecture aligns with cloud best practices and gain guidance for making improvements. AWS Well-Architected
Framework assessment questions can be answered inside your AWS account, and then a scorecard can be obtained
that evaluates your application in respect to the below six pillars
Page 30
Cloud With Raj
www.cloudwithraj.com
AWS Migration Tools
Oracle
Servers
Shared File System
AWS Database Migration
Service (DMS)
Amazon Aurora
AWS Application Migration
Service
Amazon EC2
AWS DataSync
Amazon S3
Amazon EFS
AWS Migration Hub
AWS Direct Connect
(or VPN or Internet)
AWS Application Discovery
Service
(Discovers on premises
applications)
Page 31
Cloud With Raj
www.cloudwithraj.com
Multi-site Active Active
Availability Zone
VPC
Region A
Availability Zone
Route 53
Auto Scaling Group
App
server
Elastic Load Balancer
App
server
DynamoDB
DynamoDB
continuous backup
Availability Zone
VPC
Availability Zone
Elastic Load Balancer
DynamoDB
DynamoDB
continuous backup
Region B
DynamoDB global
table replication
AWS Cloud
Geolocation/Latency Routing
Auto Scaling Group
App
server
App
server
Page 32
Developer
Code &
Dockerfile
Manifest
with image url
(In Local Machine)
Amazon ECR
Container Image
Amazon EKS
(Control Plane)
Cloud With Raj
www.cloudwithraj.com
Kubernetes Node Pod Container Relationship
EKS Worker Node
Container
Stored
EKS Worker Node
1. Developer creates the container image which gets deployed on Amazon EKS.
2. The container image runs inside Kubernetes Pod. Pod is the minimum deployable unit in Kubernetes.
A container can NOT run without being inside a pod
3. Pod runs inside EC2 worker nodes. There can be multiple pods running inside one EC2.
4. Easy way to remember this is "NPC" (like in video games) = Node - Pod - Container in that order
i.e. Node hosts one or many pods. One pod hosts one or many containers
Page 33
Cloud With Raj
www.cloudwithraj.com
Batch Workloads on AWS
AWS Batch
Amazon EventBridge
Scheduler
Amazon ECS
Amazon EKS
Fargate
EC2
Amazon ECR
1. Use EventBridge scheduler to schedule jobs. You can also trigger jobs based on EventBridge rules
which give you great flexibility and power!
2. Then the batch job is submitted via AWS Batch. Obviously you are thinking, I see container image
and then ECS/EKS, why can't I submit a container job directly from EventBridge without AWS
Batch. it's because AWS Batch maintains job queues, restart, resource management etc. that is not
available if you skip it.
3. The actual steps of the job needs to be in a container
4. The job running in a container gets submitted in either ECS or EKS. AWS Batch supports both EC2
and Fargate
Page 34
Cloud With Raj
www.cloudwithraj.com
STAR Interview
S
T
A
R
SITUATION
background of the
project.
TASK
goals that you need
to achieve
ACTION
steps that you took
RESULT
what you have achieved
One of the biggest mistake people make in behavioral interviews is, they keep on saying "we" -
"we did this task", "we came up
with a plan", "we completed step X". This may sound weird - in Amazon we expect you to be a team
player, but in interview we
want to know precisely what YOU did. Think of this as a player selection in your superbowl team. Yes, we want
the receiver to
play well with the team, but while selecting, we only care about his stats and his abilities. So make sure to
clarify what part
YOU played in your answers.
Next biggest mistake people do is they talk in hypotheticals. When a question starts with "Tell me a time
when you..", you
must give example from your past projects. If you just answer in hypotheticals such as "I will do this,
that..", you will fail the
interview.
Okay, now let's look at a sample question and answer.
Q: Tell me about one difficult project that you delivered. What was the difficulty, and how did you determine
the course of
action? What was the result?
A: I migrated our project to AWS Serverless after considering K8s and EC2. We coded the lambda, tested it,
implemented
DevOps then deployed into prod and it was a huge success.
Is the above answer good or bad? It's quite bad, why?
• Situation and Task not described
• “We” - what actions did YOU perform?
• “Huge Success” - No Data, Very Subjective
A good answer may look like this:
Situation - We have 20 Microservices running on-prem on PCF. PCF license needed to be renewed in 6 months,
leadership
wanted the project to migrate to AWS before that to save cost and increase agility.
Task - As a lead architect/developer/techlead, I was tasked to find out suitable AWS solution, within the
timeframe given.
Action - I researched possible ways to run Microservices on AWS. I narrowed it down to three options - run each
microservice on vanilla EC2, or run on K8s using EKS, or Serverless. I took one of the microservices and did POC
on Vanilla
EC2, EKS and Lambda-API Gateway. While they all did the job, I found that with EC2 I have to take care of making
it HA by
spinning multiple EC2 in multiple AZs, and there is overhead of AMI rehydration. EKS seems to be a valid
solution. However,
given the traffic patterns, we have to pay more than necessary. There is also an overhead of training the team
on K8s.
Lambda-API Gateway is inherently HA, scalable, and pay what we use and no server to manage at all. This
simplifies our day 2
operational overhead and let us focus on delivering business value.
Result - Based on all the POC data of performance, cost and time to deploy, I selected Serverless solution. We
converted
rest of the microservices to Lambda and implemented in production within 3 months. It resulted in over 90% cost
savings
over EC2 and K8s. I shared my project learnings with other teams and showed them how to code Lambda so they can
utilize it
as well. I got recognized by CIO for this effort.
Why is this answer good?
• Situation, Task, Action, Results are clearly defined
• Gives details on what I did
• Result has data, and not just "huge success"
Note that you will get follow-up questions on the answer to understand your depth and to make sure you are not
just copying
and pasting a generic answer from the Internet 😅.
Page 35
Cloud With Raj
www.cloudwithraj.com
Hybrid Cloud Architecture
Hybrid means AWS and Data Center working together for the application
+
Site to Site
VPN Or Direct
Connect
On-Prem Databases
(Will stay on-premises till all
apps move to AWS, or perhaps
due to regulatory
requirements)
Amazon EC2
(Application code moved to
AWS)
ALB
Route 53
User
Page 36
Cloud With Raj
www.cloudwithraj.com
Karpenter Consolidation Saving Money
1
Worker VM
2
Worker VM
Worker VM
Worker VM
Underutilized Nodes
kind: NodePool
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
Worker VM
Worker VM
Worker VM
Worker VM
Pods Consolidated (Binpacked)
Worker VM
Worker VM
Unused nodes terminated == Significant savings
3
4
Enable Consolidation at Karpenter NodePool YAML
To Know More: Get Highest Rated Karpenter
Masterclass Course in Udemy (Click the image)
Page 37
Cloud With Raj
www.cloudwithraj.com
API Gateway Auth
API Key
Lambda
API Gateway
DynamoDB
Static key sent
on header
IAM
Lambda
API Gateway
DynamoDB
Access key
Secret access key
IAM
Validates IAM credentials
(Generate static key)
Cognito User Pool
Lambda
API Gateway
DynamoDB
Cognito
Userid, pwd
Exchange userid, pwd for
temporary JWT token
JWT token
Validates token
Cognito User Pool with Federated Identities
Lambda
API Gateway
DynamoDB
Cognito User Pool
Authenticate
Exchange third party userid, pwd
for temporary JWT token
Send IAM creds
Exchange token with
IAM creds
IAM
Cognito Federated
Identiy
Validate IAM creds
Third Party Identity Provider (IdP)
Lambda
API Gateway
DynamoDB
Identity Provider
Exchange IdP userid, pwd for
temporary JWT token
Send token
IAM
Lambda Authorizer
Validates token with IdP,
issues IAM creds
Validate IAM creds
Total of 5 Auth Mechanism
1
2
1
2
3
3
1
2
3
4
1
2
3
4
6
5
7
1
2
3
4
5
Page 38
Cloud With Raj
www.cloudwithraj.com
Serverless Web Application
S3
(Static website - CSS, JS, images)
Lambda
API Gateway
DynamoDB
CloudFront
(Handles authentication
and authorization,
throttling, DDOS
protection, and more)
API invoked for
additional info (e.g. a
button is clicked on
the site)
Route 53
(Assign readable domain)
Page 39
Cloud With Raj
www.cloudwithraj.com
EDA (Event Driven Architecture) with Kubernetes
SQS
Event Store
Application
Scales HPA based on number
of messages in queue
Pending Unschedulable Pods of
Message Processing App
Karpenter provisions nodes in
response to pending pods
Worker VM
Kubernetes Cluster
Karpenter can be used with other CNCF projects to deliver powerful solutions for common use cases. One prominent
example
of this is using Kubernetes Event Driven Autoscaling (KEDA) with Karpenter to implement event driven workloads.
With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be
processed.
One popular implementation is to scale up worker nodes to accommodate pods that process messages coming into a
queue:
1. Application inserts messages into the queue
2. KEDA monitors queue depth, and scales HPA for the application for processing the messages
3. HPA increases number of pods. Assuming, no capacity available to schedule those pods, Karpenter preovisions
new nodes
4. Kube-scheduler places those pods on the VMs. The pods process the messages from queue
5. Once processing is done, number of pods go to zero. Karpenter can scale down VMs to zero for maximum cost
efficiency
Page 40
Cloud With Raj
www.cloudwithraj.com
EDA (Event Driven Architecture) with SNS SQS Kubernetes
SQS1
Event Store
Application
Scales HPA based on number
of messages in queue
Pending Unschedulable Pods of
Message Processing App
Karpenter provisions nodes in
response to pending pods
Worker VM
Kubernetes Cluster
SNS
Payload based
filtering &
routing
SQS2
Event Store
Lambda
DynamoDB
Page 41
Cloud With Raj
www.cloudwithraj.com
EKS Autonomous Mode
Managed EC2
Control Plane
(Managed by AWS)
Karpenter
Ingress
Storage
AWS makes it Scalable, HA, Secure
Managed EC2
Managed EC2
Data Plane
Core DNS, Kube Proxy, CNI runs as
processes, baked into the AWS
managed AMI
- AWS Manages core addons like Karpenter, Ingress, and EBS
- AWS makes them secure, scalable, HA just like EKS control plane components
- AWS manages the version
- AWS Manage AMIs
- AMI is based on Bottlerocket with Core DNS, Kube-Proxy, VPC CNI components baked
in
Managed worker nodes are EC2s, that means you can run pretty much anything that you can
run on regular Kubernetes
This is a big one - EKS Auto supports daemonsets, so you can run your favorite tools and
agents
Use Reserved Instances, Savings Plan, Spot, Graviton with EKS Auto
Amazon EKS
Page 42
Cloud With Raj
www.cloudwithraj.com
Amazon Aurora Global Database Vs. DSQL
Availability Zone
VPC
Region A
Availability Zone
Route 53
Auto Scaling Group
Elastic Load Balancer
App Server
Amazon Aurora DSQL
(Multi-AZ)
Availability Zone
VPC
Availability Zone
Auto Scaling Group
Elastic Load Balancer
App Server
Region B
Both regions can
accept writes,
unlike Global
Database
AWS Cloud
Geolocation/Latency Routing
App Server
App Server
Amazon Aurora DSQL
(Multi-AZ)
Amazon Aurora released DSQL, which may seem similar to Aurora Global Database. What are the similarities and
differences?
1. Amazon Aurora, regardless of using Global Database, or DSQL, stores data in multi-AZ within a single region.
In the unlikely event of a region failure, your
data is highly available
2. The main difference is - Global Database replicates data from one region to another, and only one region can
write to the table. The secondary region acts
as a reader. Only when the primary region fails, the secondary region table get promoted to the primary (writer)
instance.
Whereas in DSQL it is Active-Active i.e. both regions can accept writes and cross-replicate. DSQL also stores
the transaction logs that can be used to recover
lost transactions during a disaster.
3. Aurora Global Database is supported for MySQL and PostgreSQL, and DSQL public preview is available for
PostgreSQL
4. Till now, Aurora can autoscale Aurora Replicas for reading only. But DSQL can horizontally scale both reads
and writes (Compute and Storage) which will be a
big leap and a game changer for SQL databases.
Page 43
Cloud With Raj
Amazon S3 Tables with Apache Iceberg
www.cloudwithraj.com
AWS Glue
(S3 Table Catalog)
Automatic Catalog
Amazon EMR
Amazon Athena
Amazon Redshift
Amazon QuickSight
Amazon Data Firehose
S3 Tables
Analytics using various
services
Simple SQL Query
R/W via Apache Iceberg
Standard
Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of
SQL tables to big data, while making it possible
for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same
time.
- Prior to this announcement, you could save data as Iceberg tables in General S3 bucket which has overhead (use
Data Catalog in Lake Formation)
- With this announcement - Define tables easily and much more performant
- 10X TPS, 3X faster query performance compare to Iceberg data stored in general buckets
- Automatically registered to Glue data catalog
- Catalog can be used by Amazon EMR (Spark), Athena, Redshift, Quicksight, Data Firehose
- Fully managed Apache Iceberg Tables in S3 - each table gets an ARN (Amazon Resource Name)
- Optimized performance, security controls, cost optimization
- Automatic compaction
- Run simple SQL query
- Applications read and write data via Apache Iceberg Standard
Page 44
Docker Vs Kubernetes
Developer
Code &
Dockerfile
Manifest
with image url
(In Local Machine)
Amazon ECR
Docker Image
Test Container in Local
Machine
docker build
docker run
docker push
kubectl apply
Cloud With Raj
www.cloudwithraj.com
Amazon EKS
Kubernetes
Files, libraries,
dependencies packaged
together
File with commands
to create docker
container
Kubernetes is a container orchestrator
Manages auto healing, scaling, networking
between multiple replicas of the running
docker container images
EC2 Worker Node
EC2 Worker Node
Docker Vs Kubernetes
1. The term docker is used in lot of places in container lifecycle. Hence let's understand with the flow
above
2. The two primary areas the term "Docker" is associated are - Dockerfile, and mostly Docker Image
(also
known as Docker container image, or just container image)
3. Dockerfile is a file with commands that packages up the code, libraries, and dependencies in to the Docker
container image
4. This docker image is just a single copy of your application, that is often tested in your local machine (note
the single docker container image on the laptop!). This image is stored in a registry like Amazon ECR
5. Finally, using "Kubectl apply" command, an YAML manifest which has the URL of the container image
from
the Amazon ECR, is deployed into the running Kubernetes cluster.
6. Running a single docker container is easy, but when you need to run many copies (known as replicas in
container lingo) of your container image, there needs to be a control plane managing different things like -
when one running container dies, provision another in it's place, scale the replicas, networking between
them
and more. This is exactly what Kubernetes does.
Page 45
System Design Trade-Offs
Cloud With Raj
www.cloudwithraj.com
Increased
Reliability
Price
Availability Zone
App
server
App
server
Availability Zone
Region A
Replicate infra in
another region or more
AZs - increases cost
Region A
Control
Management
Overhead
Lambda
Amazon EC2
AWS manages
underlying infra
Customer
manages VM
Lightsail
Kubernetes
Easy Way
EKS Auto
Container in
managed VM
...
...
Other
Other
Consistency
Performance
DynamoDB
Microseconds
response time
Amazon RDS
Follows ACID
properties
Page 46
Top 3 Popular Designs
Cloud With Raj
www.cloudwithraj.com
Lambda
Database
Amazon DynamoDB
Synchronous Microservice
API Gateway
SQS
Lambda
Event Store
/buy (POST)
(processes messages from SQS for
cloudwithraj.com/buy (POST)) )
Database
Amazon DynamoDB
Websocket
API
Event Driven Architecture
API Gateway
/buy (POST)
(Handles traffic of
cloudwithraj.com/buy (POST))
Ingress
/buy
Target Group 1
/browse
Target Group 2
Database
Amazon Aurora
Path Based Routing
Domain:
cloudwithraj.com
Deployment for /buy
Deployment for /browse
Database
Amazon DynamoDB
ALB
(url: cloudwithraj.com)
Amazon EKS
Amazon EKS
Kubernetes Ingress
Route 53
Page 47
3 Tier Architecture with Microservice
EC2
Webserver
External Facing ALB
Internal ALB
Database
Amazon Aurora
EC2
Webserver
EC2
Appserver
EC2
Appserver
Auto Scaling Group
Auto Scaling Group
Availability Zone 1
Availability Zone 1
Availability Zone 2
Availability Zone 2
PRESENTATION LAYER
APPLICATION LAYER
DATABASE
Cloud With Raj
www.cloudwithraj.com
EC2
(Running Code)
ALB
Database
Amazon Aurora
EC2
(Scaled Up)
Auto Scaling Group
/browse
Target Group 1
Lambda
/buy
Target Group 2
/* (Catch all)
Target Group 3
(Handles traffic of
cloudwithraj.com/browse)
(Handles traffic of
cloudwithraj.com/buy)
(Handles traffic of anything
else)
Database
Amazon DynamoDB
Database
Amazon Aurora
Microservice 1
Microservice 2
Microservice 3
EKS
Path Based Routing
Three-Tier Design
Detailed Microservice Design
Page 48
Developer
Code &
Dockerfile
Manifest
with image url
(In Local Machine)
Amazon ECR
App Container Image
Cloud With Raj
www.cloudwithraj.com
Pod Container Sidecar
Container
Stored
App Container Image
10.15.25.215
App Container
Sidecar Container
Exposed at port 80
10.15.25.215:80
1. In Kubernetes container(s) run inside pod
2. Generally, inside one pod, one application container runs
3. But sometimes, another container runs inside the same pod along with application container called
sidecar containers. Sidecar containers have their own independent lifecycles. They can be started,
stopped, and restarted independently of app containers.This means you can update, scale, or maintain
sidecar containers without affecting the primary application.
4. Each pod has an IP address, all containers inside the pod share the same IP address. For that
reason, application container is exposed using a port. In this example the pod IP is 10.15.25.215,
and the app container is exposed at port 80.
5. One popular example of sidecar container is Istio service mesh which runs inside each pod alongside
app container
Page 49
Cloud With Raj
www.cloudwithraj.com
Control Karpenter Consolidation
1
Worker VM
2
Worker VM
Worker VM
Worker VM
Underutilized Nodes
kind: NodePool
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30m
Worker VM
Worker VM
Worker VM
Worker VM
Pods Consolidated (Binpacked)
Worker VM
Worker VM
Unused nodes terminated == Significant savings
3
4
Enable Consolidation at Karpenter NodePool YAML with consolidateAfter
Put value 'Never' to stop consolidation
WAIT 30 mins from last pod added or deleted to consolidate
Page 50
Cloud With Raj
www.cloudwithraj.com
Cross-Account EventBridge Target
API Gateway
SQS B
Based on values
in the message,
EventBridge can
fire different
targets
EventBridge
Rule 1
Rule 2
AWS Account A
(Team A)
Lambda B
AWS Account B
(Team B)
AWS Account C
(Team C)
Lambda C
Page 51
Amazon EKS
Cloud Implementation
Observability
Scaling
Delivery/Automation
Security
Cost Optimization
Prometheus
Grafana
Fluentbit
Jaeger
ADOT
CloudWatch
Karpenter
AutoScaling
Argo
Terraform
Jenkins
Github Actions
Gitlab CICD
Gatekeeper
Trivvy
ECR Scan
GuardDuty
Kube Bench
Secrets
Manager
Istio
CloudWatch
Container
Insights
Split Cost
Allocation
Kubecost
X-Ray
Cloud With Raj
www.cloudwithraj.com
Kubernetes Tech Stack
Page 52
Cloud With Raj
www.cloudwithraj.com
API Gateway, Lambda, AWS Service
DynamoDB
Use Lambda to Transform NOT Transport
API Gateway
Lambda
Is Lambda inserting/reading data without
any business logic or data manipulation?
DynamoDB
API Gateway
API Gateway can interact with
many AWS services directly
Page 53
Microservices Tech Stack
EC2
(Running Code)
ALB
(url: cloudwithraj.com)
Database
Amazon Aurora
EC2
(Scaled Up)
Auto Scaling Group
Cloud With Raj
www.cloudwithraj.com
Domain: cloudwithraj.com
/browse
Target Group 1
Lambda
/buy
Target Group 2
/* (Catch all)
Target Group 3
(Handles traffic of
cloudwithraj.com/browse)
(Handles traffic of
cloudwithraj.com/buy)
(Handles traffic of anything
else)
Database
Amazon DynamoDB
Database
Amazon Aurora
Microservice 1
Microservice 2
Microservice 3
EKS
Path Based Routing
Amazon ECR
Container Image
Page 54
High Scale Live Event Streaming
Cloud With Raj
www.cloudwithraj.com
AWS
Elemental
MediaLive
AWS
Elemental
MediaPackage
CloudFront
- Takes raw video input
- Transcode into
multiple bitrates
- Packages video in
multiple streaming
formats to support
various devices
- Global distribution
of the live feed
Devices
Live Cameras
in Stadium
S3
- Save for Video on
Demand
Page 55
Dynamic Ad Insertion in Live Stream
Cloud With Raj
www.cloudwithraj.com
AWS
Elemental
MediaLive
AWS
Elemental
MediaPackage
CloudFront
- Takes raw
video input
- Transcode into
multiple bitrates
- Packages video
in multiple
streaming
formats to
support various
devices
Insert
Ads
Devices
Live Cameras
in Stadium
AWS
Elemental
MediaTailor
1
2
EC2
Running Ad Decision
Server (ADS)
3
4
5
6
ad markers
ad A
ad B
A
B
- Global distribution
of the live feed
1. CloudFront requests manifest (video clip with ads) with viewer information which is used for ad
personalization
2. MediaTailor gets manifest i.e. video clip with ad markers (which time frame to insert the ads). At
this point NO ads have been inserted
3. MediaTailor requests personalized ads based on viewer information from ADS (Ad Decision Server).
ADS is a third party software, that can run on EC2
4. ADS running on EC2 return ad(s) to the MediaTailor
5. Mediatailor inserts these ad(s) into the manifest at the ad markers
6. MediaTailor returns this video clip with ads or manifest with ads to the CloudFront. The clip with
ads are shown to the viewers consuming from the CloudFront